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Abstract 

The wealth of available genomic data presents an unrivaled opportunity to study the molecular basis of evolution. Studies 
on gene family expansions and site-dependent analyses have already helped establish important insights into how proteins 
facilitate adaptation. However, efforts to conduct full-scale cross-genomic comparisons between species are challenged by 
both growing amounts of data and the inherent difficulty in accurately inferring homology between deeply rooted species. 
Proteins, in comparison, evolve by means of domain rearrangements, a process more amenable to study given the strength of 
profile-based homology inference and the lower rates with which rearrangements occur. However, adapting to a constantly 
changing environment can require molecular modulations beyond reach of rearrangement alone. Here, we explore rates and 
functional implications of novel domain emergence in contrast to domain gain and loss in 20 arthropod species of the pan- 
crustacean clade. Emerging domains are more likely disordered in structure and spread more rapidly within their genomes 
than established domains. Furthermore, although domain turnover occurs at lower rates than gene family turnover, we find 
strong evidence that the emergence of novel domains is foremost associated with environmental adaptation such as abiotic 
stress response. The results presented here illustrate the simplicity with which domain-based analyses can unravel key players 
of nature's adaptational machinery, complementing the classical site-based analyses of adaptation. 
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Introduction 

Since eukaryotic genomes are sequenced at an ever- 
increasing pace, comparative genomics has become an 
indispensable approach in many areas of molecular bio- 
sciences. One major goal is to understand, from a molecular 
perspective, how adaptation, development, and speciation 
have come about. However, automated functional inter- 
pretation of evolutionary traits in molecular terms is still a 
daunting task: accurate genome-scale de novo predictions 
of gene and protein structure as well as function are far from 
feasible. Moreover, many predicted protein-coding genes 
are "orphans" that lack detectable homology to known pro- 
teins, yet may likely be key players in the process of adapta- 
tion (Khalturin et al. 2009; Johnson and Tsutsui 201 1 ). 

However, by considering the modularity of protein 
evolution, valuable insights into the evolutionary forces 
shaping the functional make up of genomes have been 
obtained (Chothia et al. 2003; Pasek et al. 2005; Moore et al. 
2008; Buljan et al. 2010). A key insight to start with is the 
observation that the overall number of novel, that is, of pre- 
viously unreported, domains seems to converge, whereas 
the number of known modular arrangements of these do- 
mains is still rapidly expanding (Levitt 2009). Domains are 
the functional and structural constituents of proteins. They 
are evolutionary well conserved across taxa (Elofsson and 
Sonnhammer 1999; Finn et al. 2010) but frequently rear- 
ranged between and within proteins and genomes (Moore 
et al. 2008). These rearrangements can be observed inde- 
pendently of whether domains are defined from a structural 



perspective (see, e.g., Apic et al. 2001; Wang and Caetano- 
Anolles2009) or an "implicit" evolutionary perspective, that 
is, by comparing sequence fragments that are conserved 
across many taxa (Bjorklund et al. 2005; Ekman et al. 2005). 
The events underlying domain rearrangements are dupli- 
cation, fusion, and fission (Kummerfeld and Teichmann 
2005; Pasek et al. 2005) as well as terminal domain loss 
(Bjorklund et al. 2005; Weiner et al. 2006; Buljan et al. 
2010). These events are likely fueled by a series of underly- 
ing genetic events such as nonallelic homologous recombi- 
nation, nonhomologous end joining, transposition events, 
or combinations thereof. Eukaroytic proteomes contain a 
larger proportion of multidomain proteins than bacteria 
and archeae (Apic et al. 2001; Ekman et al. 2005), and some 
studies concentrating on smaller clades found that rear- 
rangement rates differ between kingdoms (Ekman et al. 
2007). 

The ability to reuse is a hallmark of modular design, 
and the rearrangement of existing domains is more fre- 
quent than the formation of novel domains (Apic et al. 
2001). Ergo, it seems likely that functional novelty, such 
as required in the wake of environmental shifts, can be 
generated by modular rearrangements as opposed to the 
formation of novel domains. However, there is evidence 
that rearrangements of intact domains do not strongly 
alter arrangement functionality (Tjoelker et al. 2000; Koide 
2009), whereas effects such as modified binding affinity or 
substrate specificity may result (Yu and Lutz 2011). Conse- 
quently, certain molecular innovation, such as required for 
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the adaptation to new environments, may be out of reach 
by rearrangement alone and may be instead facilitated by 
the emergence of novel domains. Indeed, change can be ob- 
served not only in the arrangements present in a genome 
but also in domain content (Itoh et al. 2007). For example, 
more than half of the domains present in Homo sapiens orig- 
inate before the metazoan era; only ^2% originate in H. sapi- 
ens (Pal and Guda 2006). Although these turnover rates of 
domains across proteomes seem low, they nonetheless al- 
low for comparative analyses from which phylogenies can be 
reconstructed (Bjorklund et al. 2005; Yang et al. 2005; Wang 
and Caetano-Anolles2006) and can be qualitatively related 
to functional classes (Pal and Guda 2006; Itoh et al. 2007; 
Zmasek and Godzik 201 1 ). 

These findings suggest that, albeit rare, novel domains 
may emerge as a result of functional challenges not met 
by modular rearrangements; such novel domains may con- 
fer a high adaptive potential. Accordingly, we here ask how 
frequently domain families are gained and lost and, in par- 
ticular, how frequently novel domain families emerge and 
whether such new families confer new functionalities. We 
address these questions in the pancrustacean clade as it is 
densely covered with well-annotated genomes representing 
species splits ranging from 1.2 to ^450 My. Furthermore, 
the pancrustacean clade incorporates species with a wide 
range of adaptational diversity including both cosmopolitan 
generalists and geographically restricted specialists. Given 
that evolutionary analyses are not confounded by whole- 
genome duplications and that the overall topology of the 
species tree is well established (Meusemann et al. 2010), the 
pancrustacean clade provides an excellent data set to study 
the dynamics of domain turnover across proteomes. 

The approach taken here may aid the functional analy- 
sis of future genome and proteome projects as it exploits 
the high precision of profile-based domain detection and is 
complementary to methods using site-based sequence anal- 
ysis and turnover of gene families. 

Methods 

Proteomes and Annotation 

Due to the high density of available genomes within 
the clade, we chose to analyze domain emergence 
within pancrustacea. We used the predicted peptides 
of the 12 Drosophila species (Drosophila Genome 
Consortium , 2007): Drosophila simulans (r1.3), D. sechellia 
(r1.3), D. melanogaster (r5.11), D. yakuba (r1.3), D. erecta 
(r1.3), D. ananassae (r1.3), D. pseudoobscura (r2.3), D. 
persimilis (r1.3), D. willistoni (r1.3), D. mojavensis (r1.3), 
D. virilis (r1.2), and D. grimshawi (r1.3). The proteomes 
were obtained from FlyBase. We complimented the 
Drosophila data set with the proteomes of the three 
mosquitoes Anopheles gambiae (P3.49), Culex pipiens (1.2), 
and Aedes aegypti (L1.49) (obtained from VectorBase); 
the moth Bombyx mori (1.0, obtained from the Silkworm 
Genome Database); the beetle Tribolium castaneum 
(51,906, obtained from BeetleBase); the two hymenoptera 
Nasonia vitripennis (1.2, obtained from the Baylor 



College of Medicine/Human Genome Sequencing Center 
(BCM/HGSC)) and Apis mellifera (4.0, obtained from 
BCS/HGSC); and the coleoptera Daphnia pulex (060905, 
obtained from the Joint Genome Institute). As outgroups 
we used the proteomes of H. sapiens (NCBI36.51, obtained 
from Gen Bank) and Caenorhabditis elegans (WS 206, 
obtained from WormBase). We chose these outgroups 
in order to identify old domains that are common to a 
wide range of taxa and hence cannot be specific to the 
pancrustacean clade; the use of outgroups that are only 
distantly related to the species considered reduces the 
number of pancrustacea-specific domain candidates. For 
the complete tree including outgroups, see supplementary 
figure 3, Supplementary Material online. 

Proteomes were scanned using the pfamscan utility and 
HMMER 3.0 against Pfam-A and B domain models obtained 
from Pfam (v.24) (Finn et al. 2010). For Pfam-A, we em- 
ployed the curated, model-defined gathering threshold as 
bit score cutoff. For Pfam-B, we chose an £ value cutoff 
of 10 -3 , similar to previous studies (Ekman et al. 2007). 
If multiple transcripts were present, we removed all but 
the longest splice variant. The domain residue coverage is 
roughly 50% for each proteome; roughly 76% of all proteins 
had at least one domain. Due to the domain centric view 
employed in this study, we discarded proteins that lack do- 
main annotation. 

Ancestral Domain Contents: Domain Gain, Loss, and 
Emergence 

We used Dollo parsimony (Farris 1977) for prediction of an- 
cestral domain contents. The assumption underlying the 
use of Dollo parsimony is that domains are gained only 
once and that number of losses required to explain domain 
contents at nodes is minimized. Under Dollo, domain gain 
events will tend to occur early and will be offset by a large 
number of domain loss events. However, we consider Dollo 
parsimony as used here sufficiently robust. First, in this study 
we do not consider copy number variation; we consider only 
the binary state, presence or absence, of a given domain in 
any given node. Hence, a domain can only be lost along a 
branch if 1) it has been gained at an ancestral node to the 
branch considered and 2) not a single copy is present in 
the descendant node (or its subtree). Second, in most cases, 
domains represent the functional unit within a given pro- 
tein. As horizontal transfer of genetic material within eu- 
karyotes can at least be considered rare, gain events of such 
functional modules would imply de novo formation. Finally, 
the danger of overestimating loss events is larger, the more 
deeply the tree is rooted. Here, we use a shallow and densely 
populated tree dating back only 430 My. Other studies have 
successfully employed Dollo to considerably larger data sets 
(Zmasek and Godzik 201 1 ). Hence, we feel that the assump- 
tions underlying Dollo parsimony are reasonable within the 
framework of this study. 

After ancestral reconstruction, we measured domain gain 
and loss events along each branch in the tree. Two cor- 
rection steps were undertaken to distinguish between do- 
main "gain" events (where domains that can be found 
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Fig. 1. Domain loss, gain, and emergence across 20 species of pancrustacea. (a) Domain gain (squares) and loss (crosses) against branch length. 
Ancestral domain content was reconstructed using a parsimony-based approach. Events were inferred along each branch of the tree. Domain loss 
correlates well with branch length (Pearson r = 0.808, P <C 0.001). (b) Domain loss and emergence along branches. Nonclassified common 
ancestors are labeled A-H. Line strength corresponds to rate of domain loss per My along the respective branch. Domain emergence is indicated 
by green circles scaled to the number of emergence events along the respective branch (see also Table 1). Tree and approximate divergence times 
are based on Honeybee Genome Sequencing Consortium (2006) and Hedges et al. (2006). 



outside of pancrustacea are gained along a branch within 
the pancrustacean clade) and domain "emergence" events 
(where domains are gained that are only found "within" 
the pancrustacean clade). First, we only considered domains 
that are gained within pancrustacea and discarded those 
domains that were gained at ancestral nodes of pancrus- 
tacea and either of the outgroups. Of the initial 1 1,735 do- 
main gain events in the whole tree including outgroups 
(4,987 Pfam-A; 6,748 Pfam-B), a total of 8,492 (4,558 Pfam- 
A; 3,934 Pfam-B) domains are either ancient, that is, are 
shared by at least one pancrustacean species and one out- 
group species, or are gained along a branch to an outgroup. 
In both cases, the domains in question cannot be specific to 
the pancrustacean clade. Next, we constructed a database 
containing the hidden Markov models of all the remain- 
ing 3,243 domains that are gained within pancrustacea and 
used HMMER 3.0 to scan these models against a sequence 
database consisting of NCBIs NR and Integr8 (Kersey et al. 
2005); gained domains with hits to sequences of species out- 
side of the pancrustacean clade were removed, facilitating a 
set of 30 (29 Pfam-A and 1 Pfam-B) domains that emerge 
within pancrustacea. 

Emergence Bins and Disorder in Emerging Domains 
Emerging domains were grouped into three bins according 
to their age. The OLD bin contains domains that emerge 
at the root of the tree 430 Ma up until the diptera node, 
225 Ma and spans ^200 My. The RECENT bin spans 185 
My from diptera to Drosophila, the last common ancestor 
of all Drosophila species. The NEW bin incorporates all do- 
mains that are younger than 40 My (see fig. 1). We also 



constructed an ANCIENT bin, which contains domains that 
likely emerged before our root node. We did not ensure that 
domains from the ANCIENT bin actually emerge at ancestral 
nodes; we required a set of domains that are gained before 
pancrustacea. Such domains have a hit in at least one of the 
outgroups and one pancrustacean species and hence must 
be considerably older than 430 My. For disorder prediction, 
we chose randomly 100 domains from the ANCIENT bin 
while maintaining the fraction of Pfam-A and Pfam-B do- 
mains within the selection. Finally, we created a RANDOM 
bin containing 100 randomly selected domains, irrespective 
of the time point of their emergence. 

Domain Arrangements with Emerging Domains 
A domain arrangement is defined as the linear combina- 
tion of domains in a protein. To avoid overestimating the 
number of unique arrangements an emerging domain can 
be found in, we collapsed repeats to a single instance as 
copy number variation in repeats can occur between even 
closely related species (Ekman et al. 2007). Our analysis 
pipeline utilizes both custom implementations and exist- 
ing software. The pipeline consists of software for domain 
annotation, RUBY libraries for managing domain annota- 
tion and ancestral domain contents reconstruction, and 
software for assessing and visualizing overrepresentation of 
gene ontology (GO) terms. A description of the pipeline 
with links can be found online at http://iebservices.uni- 
muenster.de/radmoore/emergence. 

Functional Analysis of Emerging Domains 
To analyze the functional impact of domain gains, we con- 
ducted an overrepresentation analysis of GO (Reference 
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Genome Group of the Gene Ontology Consortium 2009) 
terms. As only 6 of the 30 emerging domains are directly 
annotated with GO terms, we employed a, to our knowl- 
edge, novel indirect GO analysis. First, we annotated all 
20 proteomes using Blast2GO (Conesa and Gotz 2008) 
with default settings. We then extracted all proteins that 
contain a gained domain (1,291). Using the entire func- 
tional annotation of pancrustacean proteins as universe, we 
sought to find functional terms that are associated with do- 
main emergence using R and Bioconductors TopGO (Alexa 
et al. 2006) package. We used the weighted algorithm of 
TopGO, which eliminates local similarities and dependen- 
cies between GO terms by utilizing the topology of the 
GO graph during the analysis. After correction for multiple 
testing using Bonferroni, we found 43 significantly overrep- 
resented terms in the ontology biological_process 
and 6 in the ontology molecular_f unction. Inspired 
by sequence logos, which are frequently used to represent 
the frequency of a nucleotide or amino acid in an align- 
ment column, we visualized the significant terms from the 
biological_process ontology using a tag cloud-like 
representation, which we call a TermLogo (see fig. 3). Tag 
clouds typically represent the importance of a given word 
or phrase within a text document by scaling them accord- 
ing to their frequency. We used a tag cloud representation 
of the GO terms and transformed the P value obtained from 
the TopGO analysis using a scaling factor r defined as 

T = I 10 S10(P)I 

such that the size of the font within the cloud does not rep- 
resent term frequency but the significance of the respec- 
tive term in the overrepresentation analysis. Hence, in our 
TermLogo, the larger the font, the smaller the associated P 
value. 

Results and Discussion 

Rates of Domain Loss, Gain, and Emergence 
We annotated the proteomes of 20 pancrustacean species 
and two outgroups using Pfam-A and Pfam-B (Finn et al. 
2010) and reconstructed the ancestral domain content at 
each node of the species tree using a parsimony-based 
approach (see Methods). We then measured, along each 
branch of the tree, the number of gained, lost, and novel do- 
mains. The results are summarized in figure 1 and table 1. 

A domain is considered to be lost at a node if it does 
not occur in any of its child nodes and gained if absent 
at a nodes' parent (which follows a well-established ap- 
proach; see also Fong et al. 2007; Rogers et al. 2010; Zmasek 
and Godzik 2011). A domain that is both gained within 
and taxonomically restricted to the pancrustacean clade is 
considered a novel "emerging" domain (see Methods). Do- 
main loss rates correlate well with branch length (see fig. 
\a and supplementary table 1, Supplementary Material on- 
line) but are lineage dependent. In total, there are 5,375 
loss events within the Drosophila clade (1,313 Pfam-A and 
4,062 Pfam-B), with an average loss rate of 3.41 ± 0.31 do- 
mains per My along Drosophila lineages. In comparison, the 



non-Drosophila lineages within the pancrustacean clade see 
a total of 1 0,81 8 loss events (3,1 80 Pfam-A and 7,638 Pfam-B) 
and exhibit an average loss rate of 4.43 =b 0.84 domains per 
My. The highest loss rates within pancrustacea can be found 
along short branches within the Drosophila clade, in par- 
ticular within the subtrees of the melanogaster subgroup 
and obscura group. This is in line with the previous studies 
focusing on gene family turnover rates (Hahn et al. 2007). 
For many of the lost domains, multiple instances can be 
found in sister taxa. The TB domain (PF00683), for exam- 
ple, is found in fibrillins and Transforming Growth Factor- 
binding proteins and is localized in the extracellular matrix. 
The TB domain is likely quite old; instances can be found in 
the outgroup H. sapiens and in the pancrustacea D. pulex, 
B. mori, A. mellifera, and T. castaneum. TB seems to have 
been lost along the branches to N. vitripennis and the last 
common ancestor of lepidoptera and diptera; it cannot be 
found within the Dropsophila clade. By loosening the £ 
value threshold to 0.1, weak traces of TB can be found in 
N. vitripennis and some Drosophila species suggesting either 
ectopic decay at the sequence level or functional divergence 
beyond detection by the current model. 

The average domain gain rate along all pancrustacean 
lineages is 1.9 ± 0.84 events per My. In comparison, the 
Drosophila lineages exhibit an average domain gain rate of 
4 ± 0.03 per My. It should be noted that inferred gain and 
loss rates are partially dependent on the chosen £ value cut- 
off used during initial domain annotation. A domain may 
diverge beyond detection, either as the result of functional 
divergence or as the result of mutations that render it non- 
functional. If the £ value cutoff used for detecting domains 
is lowered, domains previously absent may become visible 
to our analysis. Supplementary figure 2 (Supplementary Ma- 
terial online) illustrates the effect of different thresholds 
on gain and loss rates. It demonstrates that domain loss is 
particularly sensitive to variation in £ value threshold; loss 
rates decrease with more stringent cutoffs, likely as the total 
number of detected domains decreases. Domain gain is less 
affected as gain is restricted under Dollo's law. To ensure ro- 
bust rate estimation, we chose the model-defined gather- 
ing threshold for Pfam-A to minimize the number of falsely 
annotated domains. For Pfam-B, we chose a cutoff of 10 -3 
that offers a fair balance between sensitivity and selectivity 
(Ekman et al. 2007). 

Among the ^3,000 domains gained across the whole 
pancrustacean tree, a tiny fraction of only 30 domains are 
evolutionarily novel, that is, they are not detectable any- 
where outside of pancrustacea (see fig. 1b and table 1). In 
non-Drosophila arthropods, these novel emerging domains 
amount for 0.02 of the approximately two domains gained 
per My. The Drosophila clade features the largest number of 
emerging domains with more than 50% of all events dated 
to Drosophila or a descendant node of Drosophila. Within 
the Drosophila clade, the average emergence rate is roughly 
0.06 domains per My. Ergo, the Drosophila lineages see a 
3-fold increase in domain emergence in comparison to the 
rest of the pancrustacean species. Since emergent domains 
are a potential resource of evolutionary innovation, we draw 
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Table 1. Domains Emerging Within the Pancrustacean Clade . 
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domain emerges (labeled as in fig. 1, meLgrp and meLsubgrp represent the melanogaster group and subgroup, respectively; Agam represented A. gambiae); P (d ) 
denotes the prevalence of emerging domains (see text); dj denotes the total number of domain instances after resolving overlaps; d m ax represents the maximum 
count of the emerging domain d in anyone proteomejXd signifies the average count of the emerging domain; signifies the number of unique arrangements with 
the emerging domain; NCO^ shows the number of co-occurring domains. The bin average is indicated below each bin section, with standard deviation indicated in 
parentheses. Average properties of ANCIENT domains, while not emergent, are indicated for comparison. 



attention to their possible origins, evolutionary dynamics, 
molecular properties, and adaptive potential 

Radiation of Emerging Domains 

After domains first emerge, they may spread rapidly among 
all descendants or remain invariant along some lineages 
while expanding along others. For each emerged domain, we 
extracted all instances and examined their properties in the 
extant species. 

The 30 domains that emerge within the pancrustacean 
clade affect a total of 1,291 proteins (^0.36% of all pro- 
teins), to which they either are fused or form single-domain 
proteins. The distribution of domains in proteins affected 
by emerging domains suggests that older domains have 
more cooccurring domains within arrangements, whereas 
younger domains more likely form single-domain proteins. 
In order to estimate the "evolutionary success" of domains 
after they emerge, we calculated the prevalence P of a do- 
main d defined as P (d ) = y\a j n N , where ha is the number 
of child nodes that contain d and n N the total number of 



leaves a given node has. Domains that emerge in the AN- 
CIENT bin, that is, which are older than 430 My have the 
lowest average prevalence and the strongest deviation with 
0.4 ±0.5 (see table 1). 

Roughly 80% of domains that emerge in the OLD bin 
form multidomain proteins with an average number of ap- 
proximately seven neighbors per protein. In contrast, only 
roughly 50% of the domains in the RECENT bin form mul- 
tidomain proteins and have on average less cooccurring do- 
mains with only ^1.3 neighbors on average. Domains that 
have recently emerged and are younger than 40 My old 
mostly form single-domain proteins, with only one-sixth of 
the emerging domains found in multidomain proteins. 

If novel domains are the result of recruitment from non- 
coding regions, they might display a higher content of 
residues in disorder than, for example, ancient domains; 
recent evidence indicates that disorder may be evolution- 
ary difficult to maintain (Schaefer et al. 2010) and that 
gained domains contain a high proportion of disorder 
(Buljan et al. 2010). We extracted all sequences of emerging 
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Fig. 2. Arrangements of OMB domains in three species of Drosophila. Domains are represented as shapes; OMB is shown as oval box. The £ value 
cutoff for the presented arrangements is ^0.01. (a) Drosophila melanogaster has two different arrangements with OMB, one of which includes 
the T-box domain (arrow-shaped polygon, member of Pfam clan CL0073). The majority of species share the latter arrangement, (b) Drosophila 
virilis has a slightly different morphology and has three arrangements with OMB, where one instance is found in a region of domain overlap, (c) 
Drosophila grimshawi exhibits a strikingly different morphology and has, as the only species of Drosophila, a total of five arrangements that contain 
traces of omb where it cooccursor overlaps with domains that have been implicated in growth, development, and transcriptional regulation. 



domains from extant species and calculated the proportion 
of disorder in the sequence using VSL2 (Peng et al. 2006). 
We indeed find that domains that emerge within pancrus- 
tacea show a significantly higher proportion of disorder than 
ancient domains (Kruskal-Wallis P <C 0.001, see also sup- 
plementary fig. 1, Supplementary Material online). There 
were, however, no conclusive differences between the age 
bins (data not shown) that may be due to the small sam- 
ple size. Furthermore, no significant differences between 
domains in age bins could be found with respect to average 
sequence length of domains and average sequence similari- 
ties between instances of a domain (data not shown). 

Finally, we looked into the position of emergent domains 
within the D. melanogaster genome, as it has, to date, the 
most complete assembly. The majority of D. melanogaster 
chromosomes harbor less than 1% emergent domains, with 
two exceptions. On the X chromosome, 2% of the domains 
(72 of 3,738) are emergent; on the 3L chromosome, 1.5% 
domains (67 of 4,327) are emergent. Although insufficient 
for statistical inference, this could hint that novel domains 
result from increased evolutionary rates on the X chromo- 
some, for which some evidence has been obtained (Baines 
et al. 2008). 

Functional Impact of Novel Domains 
Recently emerging domains are, by definition, restricted to 
a relatively small clade and therefore not widely distributed. 
Accordingly, they are not always functionally and struc- 
turally well characterized. Twenty-nine of the 30 emerging 
domains are Pfam-A, 20 of which have been previously char- 
acterized. Only 6 of the 20 Pfam-A domains are functionally 



classified by the GO (Reference Genome Group of the Gene 
Ontology Consortium 2009). Five of the emerging domains 
are defined as "Domain of unknown function" (DUF), and 
only one of the emerging domains (DUF1074) is a member 
of a Pfam clan. Nonetheless, some of the proteins that gain 
emergent domains have been studied extensively. 

The optomotor-blind (OMB) domain, for example, oc- 
curs N-terminal of the OMB protein that plays manifold 
regulatory roles in development (Pflugfelder 2009). The 
OMB domain frequently co-occurs with members of the 
T-box family, an ancient family of transcriptional regula- 
tors thought to be a key player in animal development 
(Wilson and Conlon 2002). In D. melanogaster, the OMB 
domain has been identified as a key element in the es- 
tablishment of wing and abdominal pigmentation patterns 
(Brisson et al. 2004). Furthermore, the OMB proteins have 
been linked to a diverse array of morphological traits includ- 
ing structure of the head and external genitalia (Pflugfelder 
2009) and are thought to impact transcription of a num- 
ber of basal developmental genes such as tkv, mtv, i/g, 
and sal (del Alamo Rodriguez et al. 2004). Some of these 
genes are targets of decapentaplegic (dpp), a morphogen 
of prime importance in Drosophila development (Nellen 
et al. 1994). The OMB domain emerges along the branch 
of endopterygota and has subsequently been lost along 
some lineages while maintained along the others. By loos- 
ening the £ value threshold up to < 0.1, we can detect 
traces of the OMB domain in all other child nodes of en- 
dopterygota, with the exception of B. mor'i and A. gam- 
biae. Furthermore, we find additional copies in species that 
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already bear a copy of the OMB domain. For example, af- 
ter loosening the match requirements, we detect traces of 
additional four copies of OMB within D. grimshawi, where 
they occur in arrangements absent in all other pancrus- 
tacean species (see fig. 2). D. grimshawi is endemic to 
the island of Hawaii and is known for its strikingly differ- 
ent morphology in comparison to other Drosophila, in- 
cluding the diverse array of wing pigmentation patterns 
(Edwards et al. 2007). 

In order to globally assess the functional effect of domain 
emergence, and to overcome the weak links to GO cate- 
gories that emerging domains exhibit, we analyzed the GO 
annotations of proteins that recruited emergent domains 
using Blast2GO (Conesa and Gotz 2008). 

From the biological_process ontology, a strikingly 
high number of the statistically most significant terms cor- 
respond to environmental adaptation such as response to 
heat, drought, UV, and other abiotic stresses (see fig. 3 
and supplementary table 2, Supplementary Material on- 
line). This is followed by response to biotic stress and terms 
relating to sex differentiation and further to development 
and morphogenesis. 

The pancrustacean species considered here contain 
a number of highly specialized, geographically restricted 
species. D. sechellia, for example, habituates an archipelago 
of 115 islands in the Indian Ocean and feeds off a fruit found 
toxic to most other Drosophila species (Farine et al. 1996). 
Similarly, D. erecta and D. mojavensis are highly special- 
ized species with restricted geographic distributions (Singh 
et al. 2009). The Drosophila clade also contains cosmopoli- 
tan species such as D. melanogaster or D. simulans. Some 
Drosophila species find optimal conditions in high tem- 
perature areas, such as D. mojavensis, which is found in 
North American deserts where it feeds off rotting cactus, 
or species of the obscura group, which seek near-desert 
habitats during winter (Markow and O'Grady 2007). The 
differences among the Drosophila species affect courtship, 



developmental time from egg to adult, as well as morpho- 
logical traits (see Markow and O'Grady (2007) and refer- 
ences therein). 

The protein functionalities affected by emerging domains 
reflect these differences and illustrate the diverse life history 
and the outstanding success of the pancrustacea, in partic- 
ular the cosmopolitan species of Drosophila, in adapting to 
new environments. 

Within the other two ontologies, we find the 
cellular_component term, extracellular _space, 
as well as terms from the molecular Junction ontol- 
ogy related to DNA binding and transcriptional regulation 
to be overrepresented. 

Conclusion 

Previous studies have estimated genome-wide gene 
turnover rates, that is, gene gain and loss, within the 
Drosophila clade (Hahn et al. 2007; Rogers et al. 2010). We 
find lower domain turnover rates for domains than for 
genes. This is in line with our expectations since the average 
domain copy number across a given proteome is 4 ± 15. 
Accordingly, a gene gain or loss event will, on average, only 
affect few domains, many of which will retain copies in 
other genes. Ergo, although the copy number of domains 
will be subject to fluctuation, the presence or absence of 
domains, such as is considered here, will not be affected. A 
potentially confounding factor in the analysis of domain 
gain and loss is erroneous domain annotations. Accelerated 
rates of evolution or sequence bias in domain models 
may facilitate a signal of domain loss or shift the point of 
domain gain and hence influence emergence rates in our 
anaylsis. However, by using the model-defined gathering 
thresholds for Pfam-A domains and a conservative cutoff 
for Pfam-B domains, we are confident that the trends in our 
analysis are robust. In particular, as we find that our results 
are in agreement with a previous study on gene family 
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turnover in Drosophila (Hahn et al. 2007), we similarly 
find increased rates of loss and gain along branches to 
the simulans/sechellia subclade, as well as along branches 
within the obscura group. 

Our results indicate that thousands of domains are lost 
along every lineage. High rates of domain loss seem to entail 
a strong loss of evolutionary potential for further innovation 
as de novo formation of novel functional domains is likely 
difficult (Born berg-Bauer et al. 2010). Just how precisely can 
this loss of evolutionary potential be compensated, con- 
sidering the need of species to adapt to an ever-changing 
environment? First, depletion of the repertoire of functional 
domains may be offset by the creation of new domain ar- 
rangements. Over evolutionary long timescales, domain 
arrangements become longer and more diverse and assume 
new functions (Bjorklund et al. 2005; Ekman et al. 2005; 
Fong et al. 2007; Wang and Caetano-Anolles 2009). Second, 
new or strongly divergent proteins without any apparent 
homology even to closely related species (and accordingly 
without any domain assignment) can be found in any 
newly sequenced genome (see, e.g., Drosophila Genome 
Consortium 2007; Werren et al. 2010). Such orphan genes 
can make up to 30-40% of the gene repertoire and seem to 
be of particular importance for adaptation; their spatiotem- 
poral expression profiles can be very specific for tissues, 
developmental stages, and reproductive division of labor 
(Colbourne et al. 2011; Johnson and Tsutsui 2011). Third, 
as is shown in this study, the emergence of new domains is 
of great adaptive value and, accordingly, emerging domains 
spread rapidly across genomes. 

Finally, given the use of Dollo parsimony, loss rates should 
be considered an upper boundary (see also Methods). How- 
ever, given the comparably shallow tree employed here, 
results that are in agreement with studies that employed 
an alternative model (Hahn et al. 2007) and similar signals 
found among other taxonomic groups (Zmasek and Godzik 
201 1 ), we are confident that the overall trend should prevail. 

The emergence and rapid spread of novel domains are 
particularly striking. Domains emerge frequently in the con- 
text of abiotic stress, biotic defense, reproduction, and 
development. The former two categories have not been re- 
ported by studies focusing on gene families (Hahn et al. 
2007). A possible explanation is that domains affect only 
small parts of proteins and may thus be overlooked if they 
are incorporated in one protein out of many of a family. Fur- 
thermore, the rates of emergence reported here must be 
seen as a lower boundary. A novel domain can, almost by 
definition, not be reported by current bioinformatic tech- 
niques. Hidden Markov models (HMMs), a technique on 
which, for example, Pfam builds, first require several in- 
stances of a domain to build a profile. Accordingly, very re- 
cent domains that may still be strongly diverging or have 
just a single instance, for example, in orphan proteins, will 
be overlooked. 

The origin of new proteins remains generally elusive 
(Levineet al. 2006; Born berg-Bauer et al. 2010) and only very 
rarely can be accurately reconstructed. Here, it was found 
that novel domains mostly form single-domain proteins 



and are significantly enriched in disordered regions. Both 
facts indicate that novel domains are either the result of 
de novo formation from DNA, possibly via intermediate 
RNA genes (Zhou et al. 2008), or structurally very flexible in 
choosing novel ligands or binding partners or both. In con- 
trast, older domains have more neighbors, form a larger vari- 
ety of arrangements, and less frequently form single-domain 
proteins than newer domains. This is in line with previous 
studies (Vogel et al. 2005) that indicate that the process of 
modular rearrangement is at least partly fueled by random 
attachment. 

To our knowledge, the study presented here is the first 
to date to assess the amount of domain gain, loss, and 
emergence within a dense and exceptionally well-studied 
clade. Furthermore, since potentially confounding effects 
such as whole-genome duplications are absent, the de- 
rived rates of loss and emergence will help set a frame- 
work to push further the limits of phylogenetic inferences 
and sequence comparison based on domain arrangements 
(Bjorklund et al. 2005; Yang et al. 2005; Fong et al. 2007; 
Song et al. 2008). The greater accuracy of HMMs in iden- 
tifying homologous sequences and the relatively low rates 
of domain turnover (as opposed to amino acid replace- 
ments) help capture functional shifts at a rather coarse- 
grained level and across evolutionary long timescales of tens 
to hundreds of My. The combination of indirect functional 
inference of GO terms (by analyzing proteins that acquire 
novel domains) and graphical representation of the statisti- 
cal analysis as illustrated here provide an intuitive represen- 
tation of adaptive signals. Accordingly, our method should 
be applicable to most genome projects for which it offers 
a valuable complement to other more established meth- 
ods such as site-based statistical analysis or studies of gene 
families. 

Supplementary Material 

Supplementary figures 1-3 and tables 1 and 2 are 
available at Molecular Biology and Evolution online 
(http://www.mbe.oxfordjournals.org/). 
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